Multi-Paxos
Learn about the practical usage of the consensus algorithm, recognize the limitation of Paxos, and how to mitigate it via multi-Paxos.
The practical use of Paxos#
Consensus algorithms are a valuable tool for creating fault-tolerant systems, but how consensus algorithms facilitate fault tolerance is often unclear. Let's try to make it clear in this lesson.
We know that the fundamental approach for achieving fault tolerance is through replication, which ensures that multiple copies of a system exist to mitigate the impact of any failures that may occur. Inconsistency among copies of a system is a common challenge that must be addressed to ensure the effectiveness of the replicated system. Inconsistency is introduced into the replicas when they execute the client requests (operations that change the system's state) in different orders. For all replicas to be consistent with each other, we need to maintain an order of operations that is the same across all replicas. When each replica executes the operations in the same order, they reach the same state. The consensus algorithm helps us establish the same order of operations across all replicas. That also means that all replicas will be in mutual agreement about the state of the system.
In this lesson, we will use the Paxos consensus algorithm as the basis to generate an order of operations that is the same across all system replicas, as shown in the following illustration. The order of the operations for each replica is persistently kept in each replica's log file called the operation log. It helps lagging replicas to catch up with the leading replicas.
We know that a single instance of Paxos chooses a single value, so a single value can't help us achieve consensus on a sequence of operations to execute each operation in the same order on all replicas. We need to run multiple instances of the basic Paxos algorithm to make a consensus on the order of multiple client operations (the values that are proposed in the Paxos). We call the act of running multiple instances of Paxos as multi-Paxos. Let's see how it works with the help of the following example:
Implementing replication with Paxos#
To construct a replicated operations log through consensus on a sequence of values, transactions, or operations, the Paxos consensus algorithm is executed repeatedly for each individual transaction or entry in the operation log. One instance of Paxos is executed per transaction (an entry in the operation log).
Let's say we have three replicas of a system, as shown in the following illustration. The clients can make transactions on any replica. There are three clients, A, B, and C. Each client requests a transaction on a different replica. Replica 1 receives a transaction request from client A; replica 2 receives a transaction request from client B; replica 3 receives a transaction request from client C. On receiving a client's transaction request, each replica has to find the correct index in the operation log for that transaction to place the transaction at the same index in all replicas’ local operation logs.
Each replica needs to select a log entry for the transaction it receives. Let's see how the replica finds the log entry for a transaction in the following section.
Selecting a log entry for a transaction#
When a replica has received a transaction request from the client, it doesn't know which log entry is free and available in all replica's local operation logs where it can place that transaction to ensure consistency. As a result, it runs the Paxos protocol for the first log entry in its operation log for which no value is chosen. The Paxos instance for that log entry chooses one of the proposed values (there could be other replicas that may have run the Paxos protocols for the same entry in the operation log, and proposes the transaction requests that they receive as the value to the Paxos) for that specific entry in the operation log. The values (transactions) that have not been chosen for that log entry try again for the next entry in the operation log, for which no value is known to be chosen. Each time one value will be chosen. Paxos will run the number of times equal to the number of transactions, as shown in the following illustration:
In the above illustration, we assume that each replica has received a client request at the same time. The first log entry in each replica’s operation log is free; no value is known to be chosen for that log entry. Each replica runs the Paxos protocol to choose its proposed value (a transaction that it received from the client) for the first log entry. The proposer replica mentions the log entry/index in the prepare request that it sends to other replicas in order to show the Paxos instance it is proposing in. The Paxos instance for log entry 1 chooses a single value for that log entry, which is client B’s transaction in the illustration above.
Note: The prepare requests now look like
prepare(log index, proposal number), and the accept requests look likeaccept(log index, proposal number, transaction).
Replica B has found the log index for the transaction request it received. But replica A and replica C have not yet found the log index for their transactions. So, they find the next log entry that is not known to be chosen in their operation logs. Both have log entry 2, for which no value has been chosen yet. So, they run Paxos for log entry 2. Replica A succeeds in getting the majority of replicas on its proposal and has its proposed value (client A’s transaction) accepted and chosen. Paxos instance for log entry 2 results in client A’s transaction as the chosen value for log entry 2.
Replica A has also found the log index for the transaction request it received. While Replica C has not yet found the log index for its transactions. Replica C tries again on another index/entry, index 3, in the operation log. As a result, a value proposed by C is chosen for that index in the log.
Since we are allowing multiple proposers at the same time, it will lead to dueling proposers, the same problem that we have already discussed in the Basic Paxos in Action lesson. We can avoid this problem by electing a distinguished proposer. Now, let’s take a moment to consider whether with a single proposer at a time, do we need to send separate prepare requests for each transaction that a single proposer is running Paxos instances for? Let’s see the optimization mechanism to reduce the number of messages being passed between replicas when running multiple instances of Paxos.
Note: Proposers might use randomized wait times when they detect dueling proposers. Doing so can expedite the consensus progress.
Optimization by chaining multiple Paxos instances together#
The Paxos algorithm requires each sender (proposers, acceptors, and learners) to save their state to disk before sending messages. This results in five disk writes for each message (propose, promise, accept, accepted, and commit). These writes must be immediately flushed to disk before the the system can proceed. If replicas are in close network proximity, the time to flush to disk can dominate over the network latency in the system.
In order to reduce the number of messages involved in the Paxos algorithm, it is possible to chain multiple instances together. If the distinguished proposer's identity remains the same between Paxos instances, propose messages for all the transactions at that time can be reduced to just one prepare message. However, if the distinguished proposer's identity changes between Paxos instances, the proposer must send a separate prepare request to proceed with its proposal. By using a multi-Paxos algorithm with a long-term distinguished proposer, the number of prepare messages can be reduced to one for a large number of transactions. The distinguished proposer writes to disk after sending its propose/accept messages, while other replicas write to disk before sending their promise/accepted messages. This allows for parallel execution of multiple Paxos instances and reduces the number of disk writes required per Paxos instance.
Point to ponder
Question
Do the Paxos instances for different log slots use a proposal number higher than the previous slot?
No. Each log slot uses an independent instance of the Paxos algorithm without interfering with other slots. This is why many concurrent instances of Paxos might be in progress to get a value chosen. Remember that, for the log, we have an extra argument of the slot number to disambiguate messages for different slots.
Note: In multi-Paxos, we can substantially speed up the consensus process by using many parallel Paxos instances.
To optimize it more, we can group the transactions (values in Paxos) submitted by different clients together and process them as a single unit within a single Paxos instance. This can reduce the number of individual Paxos instances required to process each value, resulting in more efficient use of system resources and increased throughput.
Note: There are countless optimizations to the Paxos algorithm, especially multi-Paxos. Many of them have not been formally proved correct. It is quite likely that you will encounter one such instance in some software product. An extra degree of caution is required to use such an instance of Paxos because it might break some of the safety requirements in some edge cases.
Conclusion#
Paxos is a commonly used consensus algorithm. It took engineers many years to iron out all the details of the algorithm after its publication by Leslie Lamport. On the surface, Paxos seems relatively simple, but it is intriguing how it cleverly uses logical clocks to totally order the requests at all replicas and then finally reaching to a chosen state for the requests.
The Rationale behind Paxos Design Choices
Quiz on Paxos